Library catalogues and open quantification of knowledge production 1470-1800

Mikko Tolonen, Leo Lahti

June 12, 2015

Open analytical ecosystems for digital humanities

Open science principles

Library catalogues: the data

https://github.com/rOpenGov/estc

datanewbacon

Publishing “history” in Britain and North America 1470-1800

Research questions

  1. Who wrote history?
  2. Where was it published?
  3. How does the publishing of history change over the early modern period?

Load the data and tools

Load the data and tools in R:

load("df.RData")
library(bibliographica)

Fill missing dimensions

Estimate missing dimensions

kable(polish_dimensions("10 cm (12⁰)", fill = TRUE))
original gatherings width height area
10 cm (12⁰) 12to 10 15 150

Author gender

Enriching data by external information

as.matrix(get_gender(polish_author(sample(unique(df$author.name), 20))$first)$gender)
##           [,1]  
## hugh      "male"
## william   "male"
## daniel    "male"
## edward    "male"
## john      "male"
## william   "male"
## gilbert   "male"
## samuel    "male"
## henry     "male"
## levi      "male"
## zachary   "male"
## alexander "male"
## john      "male"
## anthony   "male"
## philippe  "male"
## david     "male"
## robert    "male"
## edward    "male"
## calybute  NA    
## gilbert   "male"

Who wrote history?

Who wrote history?

Top-10 authors (number of titles)

top_plot(df, "author.unique", 20)

Who wrote history?

Top-10 female authors (number of titles)

Who wrote history?

Title count vs. paper consumption

Document count vs. paper for top authors

ggplot(df2, aes(x = docs, y = paper)) + geom_text(aes(label = author.unique), size = 4)

Who wrote history?

Gender distribution for authors over time. Note that the name-gender mappings change over time. This has not been taken into account yet.

## 
## female   male 
##  0.037  0.963

Who wrote history?

Other questions to explore

df2 <- df %>% filter(publication.place == "London")
df2 <- df %>% filter(language == "French")
df2 <- df %>% filter(publication.year >= 1700 & publication.year < 1800)
top_plot(df2, "author.unique", 10)

2. Where was history published ?

Top-10 places (number of titles)

top_plot(df, "publication.place", 10)

Where was history published ?

df2 <- df %>% filter(publication.country %in% c("France", "Germany")) %>%
    group_by(publication.decade, publication.country) %>%
    summarize(paper = sum(paper.consumption.km2, na.rm = TRUE), docs = n()) 
p <- ggplot(df2, aes(x = publication.decade, y = docs, color = publication.country)) +
     geom_point() + geom_smooth()
print(p)     
## Warning in loop_apply(n, do.ply): span too small. fewer data values than
## degrees of freedom.
## Warning in loop_apply(n, do.ply): pseudoinverse used at 1599
## Warning in loop_apply(n, do.ply): neighborhood radius 180.95
## Warning in loop_apply(n, do.ply): reciprocal condition number 0
## Warning in loop_apply(n, do.ply): There are other near singularities as
## well. 29224
## Warning in loop_apply(n, do.ply): span too small. fewer data values than
## degrees of freedom.
## Warning in loop_apply(n, do.ply): pseudoinverse used at 1599
## Warning in loop_apply(n, do.ply): neighborhood radius 180.95
## Warning in loop_apply(n, do.ply): reciprocal condition number 0
## Warning in loop_apply(n, do.ply): There are other near singularities as
## well. 29224
## Warning in loop_apply(n, do.ply): NaNs produced

Where was history published ?

Title count vs. paper

publication.place paper docs
London 1.8523002 683
Dublin 0.2870011 75
Edinburgh 0.1851683 40
Philadelphia Pa 0.1192282 30
Boston 0.0146536 27
Oxford 0.0827044 22
New York N.Y 0.0031327 12
unknown 0.0100618 9
Norwich 0.0015466 6
Amsterdam 0.0036861 5
Glasgow 0.0161695 5
Bristol 0.0043038 4
Paris 0.0490283 4
Providence R.I 0.0001264 4
Hartford Ct 0.0005302 3
Newport R.I 0.0008837 3
Norwich Ct 0.0019448 3
Watertown Ma 0.0000000 3
Williamsburg Va 0.0006158 3
Aberdeen 0.0128302 2
Boston Ma 0.0049400 2
Cambridge 0.0011969 2
Chester 0.0008364 2
New London Ct 0.0001102 2
Newburyport Ma 0.0004750 2
Newcastle 0.0070300 2
Salem Ma 0.0016016 2
Salisbury 0.0012787 2
St. Omer 0.0017043 2
York 0.0003814 2
Albany N.Y 0.0007392 1
Bennington Vt 0.0022562 1
Birmingham 0.0005225 1
Bombay 0.0008100 1
Bury 0.0002964 1
Calcutta 0.0008645 1
Cambridge Ma 0.0002025 1
Canterbury 0.0008100 1
Carmarthen 0.0019950 1
Charleston S.C 0.0000640 1
Cirencester 0.0008100 1
Cork 0.0015808 1
Coventry 0.0001350 1
Dunbar 0.0002850 1
Evesham 0.0117040 1
Exeter 0.0001350 1
Geneva 0.0080769 1
Gouda 0.0004928 1
Harrisburgh Pa 0.0007362 1
Hillsborough N.C 0.0001350 1
Hull 0.0008100 1
Leeds 0.0043225 1
Limerick 0.0063726 1
Litchfield Ct 0.0000693 1
Maidstone 0.0004928 1
New Bern N.C 0.0014168 1
Perth 0.0003696 1
Portsmouth N.H 0.0004940 1
Quebec 0.0008768 1
Twickenham 0.0001232 1
Vienna 0.0000000 1
Waltham 0.0001350 1
Washington D.C 0.0002268 1
Westminster Vt 0.0000540 1
Wilmington De 0.0000693 1
Winchester 0.0008100 1
Worcester 0.0091784 1
ggplot(df2,
     aes(x = log10(1 + docs), y = log10(1 + paper))) +
     geom_text(aes(label = publication.place), size = 3) +
     scale_x_log10() + scale_y_log10() 

Where was history published ?

Scotland, Ireland, US comparison:

df2 <- df %>%
    filter(!is.na(publication.country)) %>%
    group_by(publication.country) %>%
    summarize(paper = sum(paper.consumption.km2, na.rm = TRUE),
          docs = n()) %>%
    arrange(desc(docs)) %>%
    filter(publication.country %in% c("Scotland", "Ireland", "USA"))

Where was history published ?

p1 <- ggplot(subset(melt(df2), variable == "paper"), aes(y = value, x = publication.country)) + geom_bar(stat = "identity") + ylab("Paper consumption")
p2 <- ggplot(subset(melt(df2), variable == "docs"), aes(y = value, x = publication.country)) + geom_bar(stat = "identity") + ylab("Title count")
grid.arrange(p1, p2, nrow = 1)

3. How does the history publishing change in the early modern period ?

What can we say about the nature of the documents? Pamphlets (<32 pages) vs. Books (>120 pages) ? Book size statistics and development over time

Nature of the documents

Nature of the documents

Estimated paper consumption by document size

Nature of the documents

Document sizes over time

Serious statistical analysis (also in the Humanities)

Open science in (digital?) humanities

Barriers to open science in the humanities

ropengov